Identification of pattern similarities by unsupervised cluster analysis

ABSTRACT

A method is provided for unsupervised clustering of data to identify pattern similarities. A clustering algorithm randomly divides the data into k different subsets and measures the similarity between pairs of datapoints within the subsets, assigning a score to the pairs based on similarity, with the greatest similarity giving the highest correlation score. A distribution of the scores is plotted for each k. The highest value of k that has a distribution that remains concentrated near the highest correlation score corresponds to the number of classes having pattern similarities.

RELATED APPLICATIONS

This application is a continuation of application Ser. No. 13/019,585,filed Feb. 2, 2011, issued Jul. 16, 2013 as U.S. Pat. No. 8,489,531,which is a continuation of application Ser. No. 11/929,522, filed Oct.30, 2007, issued Feb. 15, 2011 as U.S. Pat. No. 7,890,445, which is acontinuation of application Ser. No. 10/478,191, filed Nov. 18, 2003,now abandoned, which is a U.S. National Stage filing of PCT/US02/15666,filed May 17, 2002, which claims the benefit of U.S. provisional patentapplications No. 60/335,990, filed Nov. 30, 2001, and No. 60/292,221,filed May 18, 2001. Each of the identified applications is incorporatedherein by reference.

FIELD OF THE INVENTION

The present invention relates to the use of learning machines toidentify relevant patterns in datasets containing large quantities ofdiverse data, and more particularly to a method and system forunsupervised learning for determining an optimal number of data clustersinto which data can be divided to best enable identification of relevantpatterns.

BACKGROUND OF THE INVENTION

Knowledge discovery is the most desirable end product of datacollection. Recent advancements in database technology have lead to anexplosive growth in systems and methods for generating, collecting andstoring vast amounts of data. While database technology enablesefficient collection and storage of large data sets, the challenge offacilitating human comprehension of the information in this data isgrowing ever more difficult. With many existing techniques the problemhas become unapproachable. Thus, there remains a need for a newgeneration of automated knowledge discovery tools.

As a specific example, the Human Genome Project has completed sequencingof the human genome. The complete sequence contains a staggering amountof data, with approximately 31,500 genes in the whole genome. The amountof data relevant to the genome must then be multiplied when consideringcomparative and other analyses that are needed in order to make use ofthe sequence data. As an illustration, human chromosome 20 alonecomprises nearly 60 million base pairs. Several disease-causing geneshave been mapped to chromosome 20 including various autoimmune diseases,certain neurological diseases, type 2 diabetes, several forms of cancer,and more, such that considerable information can be associated with thissequence alone.

One of the more recent advances in determining the functioningparameters of biological systems is the analysis of correlation ofgenomic information with protein functioning to elucidate therelationship between gene expression, protein function and interaction,and disease states or progression. Proteomics is the study of the groupof proteins encoded and regulated by a genome. Genomic activation orexpression does not always mean direct changes in protein productionlevels or activity. Alternative processing of mRNA orpost-transcriptional or post-translational regulatory mechanisms maycause the activity of one gene to result in multiple proteins, all ofwhich are slightly different with different migration patterns andbiological activities. The human proteome is believed to be 50 to 100times larger than the human genome. Currently, there are no methods,systems or devices for adequately analyzing the data generated by suchbiological investigations into the genome and proteome.

Clustering is a widely used approach for exploratory data analysis.Clustering analysis is unsupervised learning—it is done withoutsuggestion from an external supervisor; classes and training examplesare not given a priori. The objective of clustering is to group datapoints into “meaningful” subsets. There is no agreed upon definition ofthe clustering problem, and various definitions appear in theliterature. For example, clustering has been defined as a search forsome “natural” or “inherent” grouping of the data. However, mostclustering algorithms do not address this problem. The vast majority ofclustering algorithms produce as their output either a dendogram or apartition into a number of clusters, where the number of clusters iseither the input, or there is some other parameter(s) that controls thenumber of clusters. In either case, a model selection technique isrequired in order to choose the model parameter, or in the case ofhierarchical algorithms, to determine which level of the dendogramrepresents the “inherent” structure of the data.

A few examples of applications of clustering include (1) analysis ofmicroarray data, where co-expressed genes are found, and the assumptionis that co-expression might be a sign of co-regulation; (2) in medicaldatasets (gene expression data, clinical data etc.), where patients aredivided into categories; (3) in any set or set of measurements to detecttrends or artifacts in the measurement protocol; and (4) in informationretrieval to partition text according to categories.

Most clustering algorithms either produce a hierarchical partitioning ofthe data into smaller and smaller clusters, or produces a partition of adataset into a number of clusters that depend on some input parameter(the number of clusters or some other parameter(s)). The questionremains, however, of how to set the input parameter, or how to determinewhich level of the tree representation of the data to look at:Clustering algorithms are unsophisticated in that they provide noinsight into the level of granularity at which the “meaningful” clustersmight be found. Occasionally, there may be prior knowledge about thedomain that facilitates making such a choice. However, even in suchcases, a method for determining the granularity at which to look at thedata is required. This is seen as the problem of finding the optimalnumber of clusters in the data, relative to some clustering algorithm.

E. Levine and E. Domany in “Resampling Method for UnsupervisedEstimation of Cluster Validity”, Neural Comp. 13, 2573-2593 (2001),assign a figure of merit to a clustering solution according to itssimilarity to clusterings of subsamples of the data. The “temperature”parameter of their clustering algorithm is selected according to amaximum of the similarity measure. However, in real data, such a maximumdoes not often occur. Other model selection techniques have difficultydetecting the absence of structure in the data, i.e., that there is asingle cluster. Further, many of algorithms make assumptions as tocluster shape, and do not perform well on real data, where the clustershape is generally not known. Accordingly, other methods for clusteringare needed. The present invention is directed to such a method.

SUMMARY OF THE INVENTION

In an exemplary embodiment, a system and method are provided foranalysis of data by grouping the input data points into meaningfulsubsets of data. The exemplary system comprises a storage device forstoring at least one data set, and a processor for executing theclustering algorithm. In the preferred embodiments, the clusteringalgorithm is a k-means or hierarchical clustering algorithm.

In the method of the present invention, a “good granularity level”,i.e., an optimal number of clusters, is defined as one at which aclustering solution is stable with respect to some perturbation of thedata, such as noise or sub-sampling. For each level of granularity, thealgorithm chooses a number of pairs of sub-samples or otherperturbations, clusters each sub-sample with the chosen level ofgranularity, then computes a similarity between pairs of clusteringsolutions. For each level of granularity, the distribution of thesimilarity between pairs of clustering solutions is computed, then thehighest level of granularity for which highly similar solutions areobtained under perturbation is chosen.

According to the present invention, the probability distribution of thesimilarity measure is evaluated in order to find the “true” number ofclusters based on a similarity measure between clustering solutions. Adot product between pairs of clustering solutions is normalized toprovide a similarity measure. Other similarity measures proposed in theliterature can also be expressed in terms of this dot product.

In one embodiment, the inventive method provides means for extraction ofinformation from gene expression profiles, in particular, extraction ofthe most relevant clusters of temporal expression profiles from tens ofthousands of measured profiles. In one variation of the presentembodiment, the clustering algorithm is the k-means algorithm. Otherembodiments include all pairwise clustering methods (e.g., hierarchicalclustering) and support vector clustering (i.e., a support vectormachine (SVM)). In one variation, the fit involves affinetransformations such as translation, rotation, and scale. Othertransformations could be included, such as elastic transformations. Inthe case of k-means, the algorithm is applied in a hierarchical mannerby sub-sampling the data and clustering the sub-samples. The resultingcluster centers for all the runs are then clustered again. The resultingcluster centers are considered to be the most significant profiles. Tofacilitate the clustering task, a ranking of the genes is firstperformed according to a given quality criterion combining saliency(significant difference in expression in the profile), smoothness, andreliability (low noise level). Other criteria for ranking include thelocal density of examples.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1 a-1 c are a plot of data which is a mixture of four Gaussians, ahistogram of the correlation score for the Gaussian data mixture, and aplot showing the overlay of cumulative distributions of the correlationscore, respectively.

FIGS. 2 a and 2 b are histogram of the correlation score for the yeastgene expression data and an overlay of the cumulative distributions ofthe correlation score, respectively.

FIGS. 3 a and 3 b are a histogram of the correlation score for 208points uniformly distributed on the unit cube and an overlay of thecumulative distributions of the correlation score, respectively.

FIGS. 4 a-4 c are a plot of the first two principle components, thehistogram of the correlation score, and an overlay of the cumulativedistributions of the correlation score; k=2 (rightmost) to k−10(leftmost), respectively.

FIGS. 5 a-5 b are graphic representations of gene pre-selection.

FIGS. 6 a-6 b are graphs of the average profiles of the clusters of theeight clustering runs and their grouping into meta-clusters.

FIGS. 7 a-7 d are graphs of curve fitting with affine transformations.

FIGS. 8 a-8 h are graphs of curve fitting with affine transformations,clustering obtained from the top 800 best quality genes.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

A typical computer system for running the inventive clustering algorithmis a Pentium®-class processor with an interface for inputting thedataset(s) of interest and a memory for storing the dataset. Alternativeprocessors include configurable logic processors made up of SRAM-basedfield programmable gate arrays (FPGAs) or configurable system on chip(CSOC) architectures, which include a processor and an array ofconfigurable logic cells on a single chip, either of which can be usedto accelerate computer-intensive operations such as clusteringalgorithms on large datasets. In addition, parallel implementations inmulti-computers have been used for executing clustering algorithms forextracting knowledge from large-scale data repositories. Selection of anappropriate computer system for executing the inventive clusteringmethod is within the level of skill in the art.

The clustering model selection algorithm works with the help of ascoring function that provides a similarity measure between twolabelings.

Let X={x₁, . . . , x_(n)}, and x₁εR^(d) be the dataset to be clustered.A labeling

is a partition of X into k subsets S₁, . . . , S_(k). It can berepresented by a function c:X→{1, . . . , k} where c(x_(i)) is thecluster to which x_(i) belongs.

A less compact representation of a labeling, which may be useful is thefollowing representation by a matrix C with components:

$\begin{matrix}{C_{ij} = \{ \begin{matrix}1 & {{{if}\mspace{14mu} x_{i}\mspace{14mu} {and}\mspace{14mu} x_{j}\mspace{14mu} {belong}\mspace{14mu} {to}\mspace{14mu} {the}\mspace{14mu} {same}\mspace{14mu} {cluster}},} \\0 & {otherwise}\end{matrix} } & (1)\end{matrix}$

Note that this representation is label independent, i.e. there is noneed to assign each point a label in {1, . . . , k}. This may be viewedas an adjacency matrix, where each cluster is a connected component ofthe graph. This representation can also be converted into arepresentation of soft cluster labeling.

Let labelings

and

have matrix representations C⁽¹⁾ and C⁽²⁾ respectively, so that

$\begin{matrix}{( {L_{1},L_{2}} ) = {\underset{i,j}{\Sigma}C_{ij}^{(1)}C_{ij}^{(2)}}} & (2)\end{matrix}$

This dot product computes the number of pairs of vectors clusteredtogether, and can also be interpreted as the number of common edges ingraphs represented by C⁽¹⁾ and C⁽²⁾. A naïve method for computing thedot product by going over all pairs of points has complexity O(n²).However, it can be computed in linear time.

As a dot product,

,

satisfies the Cauchy-Schwartz inequality:

,

₂

≦

. The correlation score, which is a normalized version of the dotproduct, is:

$\begin{matrix}{{{cor}( {L_{1},L_{2}} )} = \frac{\langle{L_{1},L_{2}}\rangle}{\sqrt{{\langle{L_{1},L_{1}}\rangle}{\langle{L_{2},L_{2}}\rangle}}}} & (3)\end{matrix}$

A. K. Jain and R. C. Dubes, Algorithms for clustering data (PrenticeHall, Englewood Cliffs, N.J., 1988) provide a number of scores forcomparing a labeling produced by a clustering algorithm with a “goldstandard” labeling. This technique is known as “external” validation. Incontrast, the present invention proposes the use of scoring functionsfor “internal” validation that do not require the “gold standard”. Inthe following, it is shown that two commonly used scoring functions canbe expressed in terms of the dot product defined above in Equation 2.Given two matrices C⁽¹⁾, C⁽²⁾ with 0-1 entries, let N_(ij) i,jε{0, 1} bethe number of entries on which C⁽¹⁾ and C⁽²⁾ have values i and j,respectively. Define the matching coefficient as the fraction of entrieson which the two matrices agree:

$\begin{matrix}{{M( {L_{1},L_{2}} )} = {\frac{N_{00} + N_{11}}{N_{00} + N_{01} + N_{10} + N_{11}}.}} & (4)\end{matrix}$

The Jaccard coefficient is the corresponding ratio when “negative”matches are ignored:

$\begin{matrix}{{J( {L_{1},L_{2}} )} = {\frac{N_{11}}{N_{01} + N_{10} + N_{11}}.}} & (5)\end{matrix}$

The Jaccard coefficient is more appropriate when the clusters arerelatively small, since in that case, the N₀₀ term will be the dominantfactor even if the solution is far from the true one. These scores canbe expressed in terms of the labeling dot product and the associatednorm according to the following proposition:

Let C⁽¹⁾ and C⁽²⁾ be the matrix representations of labelings

and

respectively. Then:

${J( {L_{1},L_{2}} )} = \frac{{\text{<}C^{(1)}},{C^{(2)}\text{>}}}{{\text{<}C^{(1)}},{{C^{(1)}\text{>}} + {\text{<}C^{(2)}}},{{C^{(2)}\text{>}} - {\text{<}C^{(1)}}},{C^{(2)}\text{>}}}$${M( {L_{1},L_{2}} )} =  {1 - \frac{1}{n^{2}}}||{C^{(1)} - C^{(2)}} ||^{2}$

This is a result of the observation that N₁₁=

C⁽¹⁾, C⁽²⁾

, N₀₁=

1_(n)−C⁽¹⁾, C⁽²⁾

, N₁₀=

C⁽¹⁾, 1_(n)−C⁽²⁾

, N₀₀=

1_(n)−C⁽¹⁾, C⁽²⁾

, where 1_(n) is an n×n matrix with entries equal to 1. The aboveexpression for the Jaccard coefficient shows that it is close to thecorrelation score.

The following list provides the routine for the model explorer algorithmaccording to the present invention:

Input: A dataset D, k_(min), k_(max) Require: A clustering algorithm,cluster (D, k)   A scoring function for label comparison, s( 

₁, 

₂)   f = 0.8   for k = k_(min) to k_(max) do    for i = 1 to maximumiterations do     sub₁ = subsamp(D, f) {a sub-sample with fraction f ofthe data}     sub₂ = subsamp(D, f)     

 = cluster (sub₁, k)     

 = cluster (sub₂, k)     Intersect = sub_(1 ∩) sub₂     Score (i, k) =s( 

 (Intersect), 

 (Intersect)){Compute the     score on the intersection of the twolabels}    end for   end for

When one looks at a cloud of data points, and at a sub-sample of it fora sampling ratio, f (fraction of points sampled) is not much smallerthan 1 (say f>0.5), one usually observes the same general structure.Thus, it is also reasonable to postulate that a clustering algorithm hascaptured the inherent structure in a dataset if clustering solutionsover different subsamples are similar, e.g., according to one of thesimilarity measures introduced in the previous section. Thus, “inherentstructure” is defined as structure that is stable under sub-sampling (oralternatively, perturbing the data). Given a clustering algorithm and adataset, this translates into a search for the best number of clustersto use in the particular instance. Note that one can also extend thesearch to look for a set of features where structure is apparent,however, in this case, the same set of features is kept.

The inventive clustering algorithm receives as input a dataset (orsimilarity/dissimilarity matrix) and a parameter k that controls eitherdirectly or indirectly the number of clusters that the algorithmproduces. This convention is applicable to hierarchical clusteringalgorithms as well: given k, the tree is cut so that k clusters areproduced. Next, characterize the stability of the clustering for each k.This is accomplished by producing a set of clusterings of sub-samples ofthe data, and comparing the labels of the intersection of pairs ofclusterings using, for example, the correlation similarity measure. Thisis performed for increasing values of k (see above for details). Thedistribution of the scores for the different values of k is thencompared (see, e.g., FIGS. 1 a-1 c). The idea is that when there isstructure in the data that is well described by the clustering algorithmwith that particular value of k, many sub-samples will produce similarclusterings, and their pairwise similarity score will be concentratedclose to 1.

Each sub-sample contains a fixed fraction of the data, f. The actualsubsampling can be implemented in various ways:

1. Select each sample independently so that the size of the intersectionbetween two samples is random.

2. Select together pairs of samples by first selecting theirintersection, then selecting the rest of the data to complete thefraction f.

3. Fix one clustering solution to be one produced on the whole dataset.(This third option was used by Levine and Domany (supra) to give afigure of merit to a particular clustering solution.)

For k=1, all clusterings are the same. This also holds for k=n, where nis the number of data points; in this case every point is in a differentcluster. When the number of clusters becomes large so that there is asmall number of points in each cluster, the solution becomes stable. Thevalue of f should not be too low so that there not all clusters arerepresented in a sub-sample. In the Examples provided below, the shapeof the distribution did not depend very much on the specific value of f.Any value between 0.6 and 0.9 worked well.

EXAMPLES

In this section experiments on artificial and real data are described.In all the experiments the distribution of the correlation score isshown. Equivalent results were obtained using other scores as well. Theparameter values f=0.8 and 200 pairs of solutions were compared for eachk. A hierarchical clustering algorithm was used, with the Ward criterionfor merging clusters (see, e.g., Jain and Dubes, supra). Similar resultswere obtained using other hierarchical clustering methods (complete andaverage linkage). The advantage of using hierarchical clustering methodsis that the same set of clusterings can be used for all values of k.

Example 1 Gaussian Data

Referring first to FIGS. 1 a-1 c, FIG. 1 a shows a mixture of fourGaussians. The histograms of the score for varying values of k for thisdata is plotted in FIG. 1 b. Histograms are shown for each value of k inthe range of 2 to 7. Observations regarding the histograms are that atk=2, there is a peak at 1, since almost all the runs discriminatedbetween the two upper and two lower clusters. At k=3, most runsseparated the two lower clusters, and at k=4 most runs found the“correct” clustering as is reflected in the distribution of scores thatis still close to 1.0. At k>4 there is no longer essentially onepreferred solution. There is, in fact, a wide variety of solutions,evidenced by the widening spectrum of the similarities. FIG. 1 c plotsthe cumulative distributions of the correlation score for each k, wherek=2 at the rightmost side of the plot (at peak 1), and k=7 being theleftmost curve.

Example 2 DNA Microarray Data

The next dataset considered was the yeast DNA microarray data of M.Eisen et. al. (“Genetics cluster analysis and display of genome-wideexpression patterns”, Proc. Natl. Acad. Sci. USA, 95: 14863-14868,December 1998.). The data is a matrix which represents the mRNAexpression levels of n genes across a number of experiments. Some of thegenes in the data have known labels according to a functional class.Five functional classes were selected along with genes that belonguniquely to these five functional classes. This yielded a dataset with208 genes, with 79 features (experiments). Data was normalized bysubtracting the mean and dividing by the standard deviation for eachcolumn. This was also performed for the rows, and repeated for thecolumns. At this stage the first three principal components wereextracted. The distribution and histogram of scores is given in FIG. 2 afor k over the range of 2 to 7. The same behavior is observed as seen inthe mixture of four Gaussians data of FIG. 1 a. Between k=5 and k=6,there is a transition from a distribution that has a large componentnear 1, to a wide distribution that is very similar to the distributionon the random data. The clustering solution that was obtained for k=5agreed well with the given labels, with a correlation score of 0.95.

Example 3 Uniformly Distributed Data

The results of a test run on data uniformly distributed on the unit cubeis shown in FIGS. 3 a and 3 b. The distributions are quite similar toeach other, with no change that can be interpreted as a transformationfrom a stable set of solutions to unstable solutions

The preceding examples indicate a simple way for choosing k as the valuewhere there is a transition from a score distribution that isconcentrated near 1 to a wider distribution. This can be quantified,e.g., by an increase in the area under the cumulative distributionfunction or by an increase in

S(K)=P(s>0.90).

The value of 0.9 is arbitrary, but any value close to 1 would work onthe set of examples considered here.

Example 4 Isolet Letter Pronunciation

The next test was run on a portion of the ISOLET (Isolated Letter SpeechRecognition) database created by Ron Cole and Mark Fanty of theDepartment of Computer Science and Engineering, Oregon GraduateInstitute, Beaverton, Oreg. and available from the UCI (University ofCalifornia at Irvine) Repository of Machine Learning Databases. (Thisdata set was generated by having 150 subjects speak the name of eachletter of the alphabet twice, generating 52 training examples from eachspeaker.) This test provides an example of what occurs with there iscluster overlap.

One thousand points of the original dataset, representing the letters“a”, “c”, “d”, “e”, “f” were used. The columns of the data were“whitened”, then the first three principle components were extracted.FIG. 4 a is a plot of the distribution of the first two principlecomponents for each of the selected letters. FIG. 4 b provideshistograms of the correlation score for each k in the range of 2 to 7,and FIG. 4 c is an overlay of the cumulative distribution of thecorrelation score with k=2 towards the far right side of the plot(near 1) and k=10 towards the left hand side of the plot.

Table 1 provides a comparison of the number of clusters identified usingthe inventive method against the number of clusters obtained using othermethods for selecting k. These other methods are described by R.Tibshirani, G. Walther, and T. Hastie in “Estimating the number ofclusters in a dataset via the Gap statistic”, Tech. Report, Departmentof Statistics, Stanford University, 2000; also published in JRSSB 2000,where the Gap Statistic methods is shown to be superior to a number ofother methods.

TABLE 1 Jain Silhouette KL Gap Subsamp 4 Gauss 4 4 9 4 4 (FIG. 1)Microarray 3 2 4 6 5 (FIG. 2) Isolet 2 5 5 — 3 (FIG. 4)

In the case of clustering algorithms whose output depends on the initialcondition, e.g., k-means, a distribution of scores exists even whenconsidering a fixed sub-sample. In such cases, the method produces anindication of how varied the solutions can be for various values of k.It has been observed that for a “good” value of k, a similar transitionoccurs, but is generally “smeared”, since k-means produces widelyvarying solutions. To address this, a version of k-means similar to thatpresented by P. Bradley and U. Fayyad in “Refining initial points fork-means clustering” (in J. Shavlik, editor, Proc. of the 15^(th) Inter.Conf. of Machine Learning (ICML '98), pages 91-99, San Francisco,Calif., 1998. Morgan Kaufmann) was used. That method produces initialconditions that converge to near optimal solutions, and use a fixedinitial condition for each value of k. According to the presentinvention, k-means clustering produces solutions that are highly stablewith respect to sub-sampling. This good result may be due to the globaloptimization criterion, which differs from the local bottom up approachof hierarchical clustering that appears to be less stable tosub-sampling.

Example 5 Gene Expression Data

A data set of 1600 genes with 12 time steps was utilized to illustratethe process undergone by gene expression profiles.

First, the genes were ranked in order of “quality” to pre-select asubset for further analysis. All the genes were ranked according tothree criteria: (1) saliency (the absolute difference between their minand max value; the larger the better); (2) smoothness (a coefficientassessing the smoothness of the profile was computed; the smoother thebetter); and (3) reliability (the average standard deviation of theexperiment replicated for the whole profile).

The ranks of the genes according to these three criteria were then addedto form a combined criterion according to which the genes were rankedagain. The result can be seen in FIGS. 5 a and 5 b, which show geneexpression temporal profiles. The ten best genes are depicted in FIG. 5a, while the ten worst genes are shown in FIG. 5 b according to acombined criterion of saliency, smoothness, and reliability.

The 5% top quality genes according to the above defined combinedcriterion (800 genes) were selected. A kernel clustering algorithm basedon k-means was then run on random subsets of 100 genes among these 800genes. A maximum number of clusters of ten were used, but only five didnot degenerate. The stability of the solutions was verified by runningagain kernel k-means on the resulting cluster centers. The solution wasrobust with respect to increasing the number of genes (doubling, to 1600genes), changing the subset size (to 200 genes) and the maximum numberof cluster (to 20 genes).

FIGS. 6 a and 6 b illustrate the average profiles of the clusters of theeight clustering runs and their groupings into meta-clusters 1-5. Thecluster centers in FIG. 6 a were obtained with multiple runs of k-meansusing random subsets of 100 genes in the top 800 best quality genes.FIG. 6 b shows the average cluster centers for the five clusters. Only asubset of the nine possible profiles that could occur is represented.

According to the present invention, the clustering algorithm is basedon, but is a variation of, the classical k-means algorithm. Thealgorithm operates by the following routine:

-   -   Initialize: Start with a random assignment of class labels to        the patterns.    -   Step 1: Compute cluster centers by averaging class members in        each class.    -   Step 2: Re-assign patterns to the cluster with nearest cluster        center.    -   Iterate step 1 and 2 until the assignment of patterns to classes        remains constant.

This algorithm differs from the classical k-means in that it uses aspecial metric to measure proximity in step 2. It is based on theresiduals of a fit of one profile onto another, using an affinetransformation that is a combination of translation, scaling, androtation (to first order). Such fit is remarkably simple to implementwith a couple of lines of MATLAB® (The MathWorks, Inc., Natick, Mass.)and is quite fast. For example, to fit a profile (a 12 dimensionalvector x2) that goes through zero between time step 5 and 6, ontoanother vector x1, one can write:

xx2=[x2,ones(12,1),[−5;−4;−3;−2;−1;1;2;3;4;5;6;7]];

w=xx2\x1;

x2fit=xx2*w;

residual=mean((x1−x2fit).̂2);

FIGS. 7 a-d depict two profiles and variations on how they can be fitone onto the other or both on their average. This provides anillustration of curve fitting with affine transformations. FIG. 7 ashows the two original profiles P1 and P2. FIG. 7 b shows the profile P1fitted on P2. FIG. 7 c shows profile P2 fitted on P1. FIG. 7 d showsboth profiles fitted to their average.

After some experimentation, the following measure of dissimilarity wasadopted:

-   -   residual0+max(residual1,residual2),        where residual0 is the squared Euclidean distance between x1 and        x2, residual1 is the residual of the fit of x1 onto x2, and        residual2 is the residual of the fit of x2 onto x1. The        rationale behind this choice is that a dissimilarity simply        based on the symmetric fit to the average is, in some cases, too        optimistic—it becomes very easy with the type of affine        transformations that are being allowed to fit any curve to a        straight line (but not vice versa). Residual 0 is added to avoid        allowing too large transformations. The same desirable        properties could be achieved by solving an optimization problem        under constraints to limit the range of transformations, but        this would be computationally more expensive.

Any positive monotonic transformation of the dissimilarity does notaffect the algorithm. It should also be noted that in Step 1 of thealgorithm, a simple average of the patterns is used, as opposed to anaverage of the curves fitted to the cluster centers. After iterating,there is a significant distortion of the cluster center profiles, someof which just become straight lines.

FIGS. 8 a-h illustrate the results of one run of the algorithm on asubset of 100 gene expression profiles with a maximum cluster number of10, where expression intensity is plotted with time. The clustering isobtained from the top 800 best quality genes. FIG. 8 a shows theoriginal profiles with random class label assignments. FIGS. 8 c, 8 eand 8 g illustrate fitted profiles of cluster members at successiveiterations. FIGS. 8 b, 8 d, 8 f, and 8 h show the cluster centers atsuccessive iterations.

The present invention provides a method for model selection inclustering. Most other approaches in the prior art are based on eitherthe sum squared distances within clusters, or some combination ofwithin-cluster distances and between-cluster distances. Not only dothese prior art methods impose the notion of how a cluster should look(e.g., “compact”), most of the methods performed poorly in practice. Theinventive method, on the other hand, provides a description of thestability of a clustering method with respect to re-sampling or noise.It is believed that this stability captures the notion of “inherent”structure that is the goal in the validating clustering solutions.

Another advantage of present invention is that it is useful inidentifying the absence of structure in the data. This differs from mostclustering validation algorithms of the prior art, with the exception ofthe gap statistic, which can only determine the number of clusters ifthat number is larger than 1.

1. A non-transitory machine-readable medium comprising a plurality ofinstructions that in response to being executed result in a computingsystem executing a process for unsupervised classification of data in adataset according to pattern similarities, comprising: a clusteringfunction to randomly assign k class labels to the data and partition thedataset in k subsets, wherein k has a range with a minimum number oftwo; a compare function to, for each value of k beginning with theminimum value of k for each pair of subsets, compute a correlation scoreon the intersection between the pair of subsets, wherein the correlationscore comprises a similarity measure between the pair of subsets and thegreatest similarity has the highest score; and a histogram function togenerate a distribution of the correlation scores for each value of k,wherein the distribution comprising the highest value of k that remainsconcentrated near the highest correlation score corresponds to anoptimal granularity level for clustering of the data according topattern similarities within the dataset, and to generate an outputcorresponding to clustered data for display or storage in a storagemedium.
 2. The non-transitory machine-readable medium of claim 1,wherein the clustering function comprises a k-means algorithm.
 3. Thenon-transitory machine-readable medium of claim 1, wherein theclustering function comprises a hierarchical clustering algorithm. 4.The non-transitory machine-readable medium of claim 1, wherein theoptimal granularity level comprises a number of clusters for which thereis a transition from a distribution that is peaked near a correlationscore of “1” to a distribution that corresponds to random data.
 5. Thenon-transitory machine-readable medium of claim 1, wherein thesimilarity measure is obtained by measuring a residual of a fit of onecluster onto another cluster, wherein the residual fit comprises using afit that is invariant with respect to affine transformations, whereinthe affine transformations comprise a combination of translation,scaling and rotation.
 6. The non-transitory machine-readable medium ofclaim 1, wherein the dataset comprises gene expression data and thepattern similarities comprise co-regulation patterns.
 7. Anon-transitory machine-readable medium comprising a plurality ofinstructions that in response to being executed result in a computingsystem separating classes within a dataset without supervision, whereinthe computing system: selects a plurality of granularity levels k, andfor each granularity level k: (a) induces perturbations in the datasetto generate a modified dataset; (b) applies a clustering algorithm tothe at least one modified dataset to produce k clusters under each ofthe perturbations; (c) creates a data subset comprising the clustersidentified in step (b); (d) applies the clustering algorithm to the datasubset using the same value of k clusters; (e) determines the stabilityof the clusterings at each granularity level k by measuringdissimilarity between data in the data subset and a cluster center forthe cluster into which the data was assigned; measures fit of the datato the cluster centers for all k granularity levels, wherein the fitcomprises using a fit that is invariant with respect to affinetransformations, wherein the affine transformations comprise acombination of translation, scaling and rotation; selects from among theplurality of granularity levels an optimum granularity level kcorresponding to the best fit; generates an output comprising thedataset clustered into a plurality of subsets corresponding to theoptimal granularity level k; and displays a graph showing the datasetclustered into the plurality of subsets, wherein the subsets correspondto classes.
 8. The non-transitory machine-readable medium of claim 7,wherein the perturbations comprise a combination of one or more ofsub-sampling the dataset, changing initialization of the clusteringalgorithm, and adding noise to the dataset.
 9. The non-transitorymachine-readable medium of claim 7, further comprising, prior toselecting a plurality of granularity levels, ranking the data accordingto one or more quality criteria and selecting a pre-determined fractionof top ranked data.
 10. The non-transitory machine-readable medium ofclaim 7, wherein the clustering algorithm is a k-means algorithm. 11.The non-transitory machine-readable medium of claim 7, wherein theclustering algorithm if hierarchical clustering.
 12. The non-transitorymachine-readable medium of claim 7, wherein the optimal granularitylevel comprises a number of clusters for which there is a transitionfrom a distribution that is peaked near a correlation score of “1” to adistribution that corresponds to random data.
 13. The non-transitorymachine-readable medium of claim 7, wherein the similarity is obtainedby measuring a residual of a fit of one cluster onto another cluster,wherein the residual fit comprises using a fit that is invariant withrespect to affine transformations, wherein the affine transformationscomprise a combination of translation, scaling and rotation.
 14. Anon-transitory machine-readable medium comprising a plurality ofinstructions that in response to being executed result in a computingsystem determining without supervision an optimal number of classeswithin a dataset, wherein each class has pattern similarities, whereinthe computing system: selects a range of granularity levels k, the rangehaving a minimum of two; applies a clustering algorithm to the datasetfor that k clusters are produced; for each value of k, selects aplurality of subsets of the dataset, wherein each sub-sample comprises afixed fraction of the dataset; selects a plurality of pairs of subsets;calculates a similarity between each pair of the plurality of pairs ofsubsets; determines a distribution of the similarities within theplurality of pairs of subsets; compares distributions of thesimilarities for all k granularity levels; selects from among theplurality of granularity levels an optimal granularity level k whosedistribution of the similarities has the highest correlation score amongcorrelation scores for all distributions of similarities; generates anoutput comprising the dataset clustered into an optimal number ofsubsets corresponding to the optimal k; and displays a graph of theclustered dataset showing pattern similarities according to the optimalnumber of subsets.
 15. The non-transitory machine-readable medium ofclaim 14, wherein the clustering algorithm is a k-means algorithm. 16.The non-transitory machine-readable medium of claim 14, wherein theclustering algorithm if hierarchical clustering.
 17. The non-transitorymachine-readable medium of claim 14, wherein the optimal granularitylevel comprises a number of clusters for which there is a transitionfrom a distribution that is peaked near a correlation score of “1” to adistribution that corresponds to random data.
 18. The non-transitorymachine-readable medium of claim 14, wherein the similarity is obtainedby measuring a residual of a fit of one cluster onto another cluster,wherein the residual fit comprises using a fit that is invariant withrespect to affine transformations, wherein the affine transformationscomprise a combination of translation, scaling and rotation.